Reducing packet size in a communication protocol

ABSTRACT

In one embodiment, the present invention includes a processor that can generate data packets for transmission to an agent, where the processor can generate a data packet having a command portion including a first operation code to encode a transaction type for the data packet and a second operation code to encode a processor-specific operation. This second operation code can encode many different features such as an indication that the data packet is of a smaller size than a standard packet, in order to reduce bandwidth. This operation code can also identify an operation to be performed by a destination agent coupled to the agent. Other embodiments are described and claimed.

This application is a continuation of U.S. patent application Ser. No.12/748,644, now U.S. Pat. No. 8,473,567, filed Mar. 29, 2010, thecontent of which is hereby incorporated by reference.

BACKGROUND

Modern computer systems are realized by the interconnection of variouscomponents including processors, memory devices, peripheral devices andso forth. To enable communication between these different components,various links may be present to interconnect one or more of the devicestogether. Systems can include many different types of interconnects orlinks. Typically, there is a given communication protocol for eachparticular type of link, and communications occurring on such link areaccording to this protocol.

In general, a communication protocol provides for a relatively standardmanner of communicating information, e.g., by way of data packets thatare formed in one agent for communication to another agent. Typical datapackets include a so-called header portion that may include command andother control information and a payload portion that includes dataassociated with the packet. Typical communication protocols forpoint-to-point communication in shared memory multiprocessor systemsprovide for a fixed data packet size. However, such fixed data packetsize can unnecessarily consume interconnect bandwidth. Assume forexample that a communication protocol dictates that a data packet sizeis a cache line size. The most common of cache line sizes in use areeither 64 bytes or 128 bytes. However, if an agent seeks to send alesser amount of bytes, e.g., 8 bytes, the full 64 or 128 byte datapacket size is still transmitted, thus needlessly consuming bandwidth.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a platform in accordance with an embodimentof the present invention.

FIG. 2 is a block diagram of a platform in accordance with anotherembodiment of the present invention.

FIG. 3 is an example packet format in accordance with an embodiment ofthe present invention.

FIG. 4 is a flow diagram of a method for performing a memory accesstransaction in accordance with one embodiment of the present invention.

FIG. 5 is a flow diagram of a method in accordance with anotherembodiment of the present invention.

FIG. 6 is a flow diagram for a remote read operation in accordance withone embodiment of the present invention.

FIG. 7 is a flow diagram for a write operation to a remote agent inaccordance with one embodiment of the present invention.

FIG. 8 is a flow diagram for a prefetch operation to a remote node inaccordance with one embodiment of the present invention.

FIG. 9 is a block diagram of a processor in accordance with oneembodiment of the present invention.

DETAILED DESCRIPTION

In various embodiments, techniques are provided to enable communicationof data transactions that include data portions less than a standarddata packet size for a given communication protocol. In this way, theinterconnect overhead of sending a transaction that requires less thanthe full amount of data payload dictated by a communication protocol canbe avoided. Still further, processing resources can be more fully used,as the processing complexity needed to handle a small amount of datapresent in a larger packet size can be avoided. That is, in conventionalsystems, a larger data packet can be stuffed with don't cares or atransaction may require a full data packet with byte enables. In thesecases considerable link bandwidth is wasted, which further underutilizesprocessor compute capabilities.

While the scope of the present invention is not limited in this regard,embodiments may be used in connection with a coherent communicationprotocol such as a serial-based point-to-point (PtP) communicationprotocol. One such example is the Intel™ Quick Path Interconnect (QPI)communication protocol; of course, embodiments may be equally used withother interconnect technologies.

In general, a communication protocol may be optimized for desktop andserver platforms and accordingly may implement a fixed data packet sizefor common workloads on such platforms. For example, a communicationprotocol in accordance with an embodiment of the present invention maydictate a standard data packet size to communicate 64 byte dataportions. This 64 byte data portion may be segmented into a plurality ofindividual flow control units, referred to as flits. In addition, a datapacket may further include a header portion including some number ofcommand flits.

While such data packets may be suitable for many applications on desktopand server platforms, for some applications on such platforms and fordifferent types of platforms such as high performance computing (HPC)platforms, this packet format can be very inefficient in transferringsmaller (e.g., 8-byte) data packets. As examples, HPC-specificoperations of a limited data payload size include loads, prefetches fromremote nodes, and stores and atomic memory operation at remote nodes.Embodiments may provide flexibility in the size of the data payloads fora packet that is transmitted along an interconnect such that moreefficient data communication can occur. The request for smaller datasizes from remote nodes will become more prevalent as usage ofpartitioned global address space (PGAS) programming paradigm gainsgreater following in the HPC community as expected.

Embodiments may enable further improvements in communication efficiencyby providing for a data transfer via a so-called atomic memory operation(AMO). An AMO involves transfer from one agent to another (typically ofa remote node) of a data operand along with an operation that is to beperformed on this operand, and a reference to another data operand,which can be obtained by the second agent, e.g., via a memory read to amemory associated with the second agent. In some embodiments, the dataoperand sent with the AMO may be of smaller size than the conventionaldata payload size for the communication protocol (e.g., an 8 byte dataoperand sent along an interconnect according to a protocol that callsfor 64 byte payloads).

To enable transactions to occur with transmission of data packets havingpayloads less than a standard payload size for a given communicationprotocol, various fields may be included in command portions of a packetthat enable a packet format having a payload portion less than thestandard payload size. As will be discussed below, in one embodimentextensions to existing request types can be provided to enable thesesmaller data packets.

Embodiments may be used in connection with many different types ofsystems. However, particular implementations may be for an HPC-typeplatform where many nodes (e.g., thousands of nodes) may beinterconnected to provide computing capability for HPC applications.Referring now to FIG. 1, shown is a block diagram of an HPC platform inaccordance with an embodiment of the present invention. As shown in FIG.1, system 100 may be an HPC platform that includes a plurality of nodes110 ₀-110 ₂. Although shown with only three nodes for illustration,understand that in that many implementations many hundreds or thousandsof nodes may be present. As seen, each node may include variousprocessing capabilities, including a plurality of processors.Specifically, each node includes two processor sockets 112 ₁-112 ₂. Eachprocessor socket may be coupled to one or more portions of a localmemory which in one embodiment may be dynamic random access memory.Specifically, processors 112 may be coupled to local memories 114 ₁-114₄. In the implementation of FIG. 1, multiple PtP links 115 may beprovided between each processor and a node controller or networkinterface controller (NIC) 120. Communications on such links 115 may bevia a PtP communication protocol, and may be within a coherent domain ofthis communication protocol.

To enable communications with other nodes (not shown in FIG. 1), acoherence domain 130 (e.g., of an original equipment manufacturer (OEM))may be provided and communications with each NIC 120 may occur by way ofan interconnect 125. These interconnects may in turn couple to aninterconnect or fabric 135, such as the fabric of a given OEM.

FIG. 2 shows a similar multi-node HPC platform. However in theimplementation of FIG. 2, system 100′ includes PtP interconnects 115that may directly couple processor sockets within a node. In otherrespects, system 100′ may be configured similarly to that discussedabove regarding FIG. 1.

Embodiments may provide for optimal utilization of the PtP protocol usedin the PtP links within the nodes of an HPC or other platform.Specifically, data transfer transactions (read or write) of very smalldata words (e.g., 8 bytes) between remote computing nodes can beperformed while maximizing the bandwidth utilization on the linksconnecting processors or other agents to the NIC.

While the scope of the present invention is not limited in this regard,remote memory operations that may be implemented using a reduced datapayload size include partial memory reads and writes, and an AMO. In oneembodiment, a non-coherent partial read command (NcRdPt1) semantics canbe used to initiate a read transaction, and may result in receipt of an(e.g.,) 8 byte data return packet, referred to herein as DataNc8. In oneembodiment, a partial memory write can be implemented using a writecombining partial write command (WcWrPt1) semantics having fields toindicate the reduced data payload.

In one embodiment, an AMO can be used to forward an (e.g.,) 8 byteoperand along with a requested operation to a remote agent, which mayread the second operand from a memory (e.g., associated with the remoteagent). In one such embodiment, a memory controller associated with thememory may perform the requested operation using the two data operandsand directly write back the result to the memory. In one embodiment, anon-coherent input/output write command (NcIOWr) semantics can be usedfor this operation.

To enable these operations, fields in header portions of certain packetscan be used to expand the addressing capability to distributed memoryat, e.g., 10's of thousands of nodes, and to manage individual threadcommunication more efficiently.

Referring now to FIG. 3, shown is an example data packet in accordancewith an embodiment of the present invention. As seen in FIG. 3, datapacket 150 includes a command header and a data portion, which may be aminimal sized data packet in accordance with an embodiment of thepresent invention and may include various fields to indicate differentinformation relevant to a transaction. As to the format of the commandheader, shown is a set of lanes, each of which may be communicated on anindividual differential pair of serial interconnects between two agents.In the embodiment shown in FIG. 3, 18 lanes of data information areprovided, in addition to two checksum lanes to provide checksuminformation, e.g., a cyclic redundancy checksum (CRC). As seen furtherin FIG. 3, the command header may be two flits, with each flit formed ofmultiple 20-bit portions referred to as phits or physical layer units.

As seen, various information is present to provide details regarding atransaction. Nonetheless, only certain of the fields are discussed indetail below. Such information stored in the different fields mayinclude information regarding addressing, transaction encoding (by wayof an operation code, also referred to as an opcode), virtual channelinformation and so forth. In addition, embodiments may include a secondor additional operation code field to provide information regarding aspecific command to be performed, e.g., for an operation to be performedwith a smaller data payload than a standard payload of a givencommunication protocol. In one embodiment, this field may be a 3 bitfield to specify a processor-specific operation to be performed by areceiving agent (e.g., a NIC that receives a request from a processor oran actual destination of the packet). In addition, yet another operationcode field may be provided to include information regarding a specificOEM command (e.g., usable to send a command from a NIC to a processor orother agent). In one embodiment, this OEM-defined command field may be a3-bit field set by an OEM device such as a NIC to specify a command tothe processor. In one embodiment, this field may be set to a value ofzero for messages sent from a processor or other agent to the NIC orother OEM device. Thus FIG. 3 shows two sets of opcodes, namely aregular opcode[3:0] that is used to denote various QPI-command types andan HPC-opcode[3:0] that is used to distinguish between GSA-stores,GSA-prefetches and AMOs. All the three operations may use the samecommand of the regular opcode, but different HPC-opcode fields are usedto distinguish between GSA-stores, GSA-prefetches and AMOs.

Still further, extended addressing information (e.g., address bits63:51) may be present within the command header to enable addressing ofa very large number of interconnected nodes, as discussed above.Embodiments may further include fields for various information to enablebetter thread-level management. Such thread-based fields may include acore identifier (e.g., 6 bits) to specify an identifier (ID) of aprocessor core that issues a request. In one embodiment, this fieldinformation may be used by a receiving agent (such as a NIC) to trackthe source of a request. Another such field may include a threadidentifier (e.g., 5 bits) to specify an ID of a thread running on theprocessor core that issued the request. Again this field may be used bythe NIC or other receiving agent to track the source of the request.Still further, a privilege level (e.g., 2 bits) may provide anindication of the privilege level of the thread running on the core.

In some embodiments, a different command header format may include,instead of a length field such as shown in FIG. 3, a byte enable fieldto indicate which of a plurality of bytes of a data portion are to beenabled. Still further, other data packet formats may have a commandheader that is a single flit along with a single flit data portion(e.g., formed of 8 bytes or less). Such a data packet can be used as aresponse or return message to provide data responsive to a read requestfor a limited data amount (e.g., 8 bytes or less).

Table 1 below shows example encodings for processor-specific andOEM-specific encodings, each of which may be a separate field within acommand header (and in addition to a standard encoding for a transactionof the communication protocol).

TABLE 1 Standard Field Name Encoding Command Notes Processor- 000:Standard transaction with NcRdPt1 9 flits for standard transactionSpecific EA Header and 2 flits for processor-specific EncodingAddress[63:51] = 0 opcode = 001 001: Partial read transaction; length ofthe request determined by Length[5:0] field 010-111: Reserved 000:Standard transaction with WcWrPt1 1 flit completion for EBDW Header andprocessor-specific Address[63:51] = 0 opcode = 001 001: Partial writetransaction, Byte mask indicates bytes to be written 010 - 111: Reserved000: Standard transaction with NcIOWr For AMOs with a completion EICHeader and Address[63:51] = 0 001: AMO or <= 8 byte write on 8 byteaddress aligned boundary 010: Prefetch request; Data bytes are don'tcare 011-111: Reserved OEM Defined Encodings to cover basic NcIOWr 1flit completion Encoding logical and arithmetic operations

Embodiments may thus be used to perform different types of transactionsincluding memory access transactions such as read and write requests, inaddition to atomic memory operations and other operations. Such otheroperations may include processor-specific operations that enable aprocessor such as an HPC processor to perform transactions not normallysupported by a given communication protocol. As examples, suchtransactions may include accessing short data items from remote nodes,performing operations atomically on data at remote agents or fetchingdata from remote agents to some storage location nearer to the initiatorof the request. In addition, embodiments may further enable OEM-specifictransactions to occur. That is, particular encodings, e.g., of theOEM-specific operation code field may enable an OEM to perform specifictransactions that are similarly not supported by a given communicationprotocol.

Referring now to FIG. 4, shown is a flow diagram of a method forperforming a memory access transaction in accordance with one embodimentof the present invention. As shown in FIG. 4, method 200 may beimplemented, e.g., by a system agent that communicates with anotheragent such as a NIC via a link according to a given communicationprotocol such as a coherent PtP protocol. Method 200 may begin byreceiving a request for a memory transaction for a minimal amount ofdata (block 210). That is, a request, e.g., from a processor core may beto load or store a small amount of data, namely an amount of datasmaller than a standard data payload size for the communicationprotocol.

At diamond 220 it may be determined whether this amount of data is lessthan or equal to a threshold. Although the scope of the presentinvention is not limited in this regard, in one embodiment thisthreshold may be eight bytes. In this way, if the amount of dataassociated with the transaction is less than or equal to eight bytes,the memory access transaction can occur with a minimal amount of flits(fewer flits than for a conventional transaction of the communicationprotocol). If it is determined at diamond 220 that the amount is notless than the threshold, control passes to block 230 where a packet maybe generated for the transaction including a standard memory transactionopcode. Thus the transaction may be sent and processed in accordancewith the standard communication protocol processing flows.

If instead at diamond 220 it is determined that a data amount is lessthan or equal to the threshold, control passes to block 240. At block240 a packet may be generated for the transaction that includes both thestandard memory transaction opcode as well as a special opcode toindicate the presence of a minimal data payload. Control then passes todiamond 250 where it is determined whether the transaction is for awrite request. If so, control passes to block 260 where a packet may betransmitted including a minimal data payload (block 260). For example,assuming that the data portion is less than or equal to eight bytes, thepacket for this memory transaction may only be two flits (one flitheader and one flit data payload). In contrast, for an examplecommunication protocol a typical write transaction data packet may benine flits. After the packet has been transmitted and handledappropriately by the destination, the agent may receive a completionpacket (block 270).

Still referring to FIG. 4, if instead at diamond 250 it is determinedthat the request is a read request, control passes to block 280 wherethe read request may be transmitted. Then when the data has been readfrom the destination location, the agent may receive the requested datain a minimal payload packet (block 290). With reference to the exampleabove, this packet may be a non-coherent read completion of minimal size(e.g., two flits).

Other embodiments may be used to perform atomic memory operations inwhich at least one operand of the operation is also of a limited size,as compared to a standard payload for a given communication protocol.Referring now to FIG. 5, shown is a flow diagram of a method inaccordance with another embodiment of the present invention. As shown inFIG. 5, method 300 may begin by receiving a request in a first agent foran atomic memory operation (block 310). As an example, this request maybe received at a home agent of a first node from an agent of a remotenode. More specifically, the request may be transmitted from the remotenode and provided to, e.g., a NIC of the local node which may in turnprovide the request to the home agent. In one embodiment, this requestmay include a first operand, an indication of the location of anotheroperand, and the requested operation such as a given arithmeticoperation. Next, it may be determined whether a memory controller iscapable of performing the requested operation (diamond 315). That is, insome implementations a memory controller coupled to a local memory ofthe node may have one or more execution units such as an arithmeticlogic unit (ALU) that is capable of performing the requested operation.Note that while FIG. 5 shows this determination, embodiments may insteadbe configured to either provide the request to the memory controller orhandle of the operation within the home agent without making thisdetermination.

If it is determined that the memory controller is not capable ofperforming the operation, control passes to block 320 where a memoryaccess request may be sent to the memory controller to obtain the secondoperand. The second operand may then be received from the memorycontroller (block 325), and accordingly the home agent may perform therequested operation (block 330). Because the request is responsive to anatomic memory operation, note that communications back to the requesteragent are not needed, as the operation can be directly performed in thefirst node.

If instead it is determined that the memory controller has support tohandle the operation, control passes from diamond 315 to block 340,where the memory access request, first operand and the requestedoperation may be sent to the memory controller. Accordingly, the memorycontroller may obtain the second operand from memory and perform theoperation (block 345).

Control passes to block 350 from both of blocks 330 and 345 such thatthe result of the operation is stored in the memory directly without theneed for a communication of any result back to the requester agent. Atthis time, the operation is complete and a completion message may besent back to the remote agent (block 360). While shown with thisparticular implementation in the embodiment of FIG. 5, the scope of thepresent invention is not limited in this regard.

FIGS. 6-8 illustrate example flows for handling operations in accordancewith different embodiments. FIG. 6 depicts transaction flows for aGSA-load (reading data from a remote node), FIG. 7 depicts transactionsfor AMO operation at the remote node and FIG. 8 depicts transactions forGSA-prefetch (prefetching data from a remote node). Note that thesefigures do not include flow details that are not relevant for thedescribed embodiments. As seen, these Feynman diagrams show a set ofprotocol agents at the local/requesting node and a set at thedestination or home memory node. The agents are a caching agent, whichis an agent having the ability to cache data, a link agent such as alink layer that can generate link messages, NICs at the source anddestination nodes, a home agent (which is an agent that owns a targetedregion of memory) and a memory controller.

With reference to FIG. 6, a flow for a remote 8 byte read is shown. Asseen, the requesting caching agent forwards the read request to thelocal NIC (SrcNIC) which in turn forwards the request to the remote NIC(DestNIC). This NIC forwards the request to the home agent. In turn, thehome agent performs local snooping (if required) and also performs amemory read. Note in this regard that the home agent thus uses anon-coherent memory access request (NcRdPt1) semantic, avoiding the needfor the NIC to perform coherency checks over the link, and takes theresponsibility of converting this request into a coherent memory request(owing to the snoop operation). Accordingly, embodiments provide for theability to perform a coherent operation responsive to a request usingnon-coherent request semantics.

Still referring to FIG. 6, the data read from memory is then sent backto the requesting caching agent through the same path as the request.Note that by using a special operation code in accordance with anembodiment of the present invention for a limited data size packet, theread request and data return messages can be sent with limited payloadpackets, thus using only 4 flits for the request path and 4 for thereturn path, instead of a total of 21 flits according to a standardcommunication protocol.

With reference to FIG. 7, a flow for an 8 byte write to a remote homeagent is shown. This write flow may be by a non-coherent IO writetransaction (NcIOWr) semantics of small size (e.g., 3 flits). As furtherseen, a completion message may be sent back to the caching agentdirectly from a source NIC and without waiting for a full memorytransaction to complete in the memory controller of the remote agent.Again, a reduced number of flits can be used to send a write requestwith a reduced data payload. Also a similar flow may be used for anatomic memory operation, and thus the indications from the remote homeagent to the caching agent and memory controller of the remote node showpossible transactions to perform the atomic memory operation.

Referring now to FIG. 8, shown is a flow for an eight byte prefetch to aremote node, where the data requested as mentioned earlier, may beprefetched to a storage location nearer to the requester and notnecessarily to the requesting processors caches. As seen, this prefetchat the requesting node may similarly be via non-coherent I/O writerequest semantics. Again note that a completion may be directlygenerated by the source NIC upon receipt of the prefetch request,enabling a caching agent to release any resources used by this requestearly and continue unrelated processing awhile the data return tostorage at the near NIC occurs. Due to the special operation code used,note that a return for a smaller data payload is realized, namely a datareturn message of 8 bytes (a 2 flit message). While shown with theseparticular flows in the above examples, understand that the scope of thepresent invention is not limited in this regard to neither the size ofthe transfer nor to the location of the storage element.

Referring now to FIG. 9, shown is a block diagram of a processor inaccordance with one embodiment of the present invention. As discussedabove, implementations may be incorporated into an HPC system includingan HPC processor. This processor may include a plurality of cores,generally represented in FIG. 9 as core logic 410. Each such core may beassociated with a cache memory 420. For example, each core may beassociated with a private cache memory such as a low level cache memory.In turn, the private cache memories may be coupled to a shared cachememory, e.g., as an inclusive cache hierarchy, although the scope of thepresent invention is not limited in this regard.

As further seen in FIG. 9, processor 400 may further includeinterconnect logic 430. Such interconnect logic may be logic inaccordance with a given communication protocol and may further providefor handling extensions to such protocol, e.g., to transmit data packetshaving a smaller data payload than the standard for the protocol, or totransmit processor or OEM-specific commands that are unsupported by thecommunication protocol. As represented in FIG. 9, such logic may includepackage generation logic 432 which may be link layer logic to receiverequests from the cores and to generate packets for transmission on oneor more interconnects that couple the processor to other agents. Suchlogic may enable communication of packets having multiple operationcodes for a single data packet, one of which is a standard operationcode, e.g., for a memory access transaction, and the second of which isto provide special information for the handling of this transaction,e.g., to indicate a smaller size of the payload portion of the packet orto provide a specific request such as atomic memory operation.

Coupled to package generation logic 432 may be a packet transmissionlogic 434, which may be physical layer logic to take the generatedpacket and format it for electrical communication along theinterconnect. While not shown for ease of illustration, understand thatboth packet transmission logic 434 and packet generation logic 432 mayhave corresponding reception logic and conversion logic to receiveincoming packets from the interconnect(s) and process the information toprovide it to the one or more cores of the processor. While shown withthis high level view in the embodiment of FIG. 9, understand the scopeof the present invention is not limited in this regard.

Embodiments may be implemented in code and may be stored on a storagemedium having stored thereon instructions which can be used to program asystem to perform the instructions. The storage medium may include, butis not limited to, any type of disk including floppy disks, opticaldisks, solid state drives (SSDs), compact disk read-only memories(CD-ROMs), compact disk rewritables (CD-RWs), and magneto-optical disks,semiconductor devices such as read-only memories (ROMs), random accessmemories (RAMs) such as dynamic random access memories (DRAMs), staticrandom access memories (SRAMs), erasable programmable read-only memories(EPROMs), flash memories, electrically erasable programmable read-onlymemories (EEPROMs), magnetic or optical cards, or any other type ofmedia suitable for storing electronic instructions.

While the present invention has been described with respect to a limitednumber of embodiments, those skilled in the art will appreciate numerousmodifications and variations therefrom. It is intended that the appendedclaims cover all such modifications and variations as fall within thetrue spirit and scope of this present invention.

What is claimed is:
 1. A method comprising: generating a data packethaving a command portion including a first opcode to encode atransaction type of the data packet and a second opcode to encode aprocessor-specific operation to be performed by a destination agent andcorresponding to a transaction not supported by a communicationprotocol, wherein the command portion corresponds to an atomic memoryoperation; and transmitting the data packet from a processor on a linkcoupled to the processor, the link according to the communicationprotocol, wherein responsive to the atomic memory operation, thedestination agent is to send a memory access request to a memorycontroller coupled to a memory to obtain a second operand, and performan operation responsive to the second opcode on a first operand and thesecond operand in one of the destination agent and the memory controllerto obtain a result.
 2. The method of claim 1, wherein the second operandis of a smaller size than a size of a data portion according to thecommunication protocol.
 3. The method of claim 1, further comprisingreceiving a completion message from the destination agent after theresult is stored in the memory.
 4. The method of claim 1, furthercomprising performing the operation in the memory controller, whereinthe memory controller includes a logic unit to perform the operation,and further comprising sending the first operand and the second opcodeto the memory controller from the destination agent.
 5. The method ofclaim 1, further comprising generating the data packet having a smallersize than a data packet size for the communication protocol.
 6. Aprocessor comprising: a plurality of cores; a link layer logic togenerate a data packet having a command portion including a first opcodeto encode a transaction type of the data packet and a second opcode toencode a processor-specific operation to be performed by a destinationagent and corresponding to a transaction not supported by acommunication protocol, wherein the second opcode is to cause thedestination agent to send a memory access request to a memory controllercoupled to a memory to obtain a second operand, perform an operationresponsive to the second opcode on a first operand included in a dataportion of the data packet and the second operand to obtain a result,and store the result in the memory; and a transmission logic coupled tothe link layer logic to transmit the data packet on a link coupled tothe processor, the link according to the communication protocol.
 7. Theprocessor of claim 6, wherein the link layer logic is to generate thedata packet with a data portion having a size smaller than a size of adata portion according to the communication protocol and to set thesecond opcode to identify the smaller size data portion.
 8. Theprocessor of claim 6, wherein the link layer logic is to transmit thedata packet having a number of flow control units, the number of flowcontrol units less than a number of flow control units for a data packetaccording to the communication protocol.
 9. The processor of claim 6,wherein the second opcode is to cause the destination agent to obtaindata from a memory and to communicate the data to the processor via areturn data packet having a data portion with a size smaller than a sizeof a data portion according to the communication protocol.
 10. Theprocessor of claim 6, wherein the second opcode is to cause thedestination agent to write data of a data portion of the data packetinto a memory associated with the destination agent, wherein the dataportion has a size smaller than a size of a data portion according tothe communication protocol.
 11. The processor of claim 6, wherein thesecond opcode is to cause the destination agent to send a completionmessage to the processor after the result is stored in the memory. 12.The processor of claim 6, wherein the first opcode is to indicate amemory access transaction and the second opcode is to indicate that dataof the memory access transaction is of a size less than a size of a dataportion according to the communication protocol.
 13. The processor ofclaim 12, wherein the memory access transaction is a non-coherenttransaction and the destination agent is to convert the non-coherenttransaction to a coherent transaction to perform coherent processingresponsive to the coherent transaction.
 14. A system comprising: a firstprocessor including a link logic to generate a data packet having acommand portion including a first opcode to encode a transaction type ofthe data packet and a second opcode to encode a processor-specificoperation to be performed by a second processor and corresponding to atransaction not supported by a communication protocol; and the secondprocessor coupled to the first processor to receive the data packet andto perform the processor-specific operation responsive to the first andsecond opcodes, wherein the second opcode is to cause the secondprocessor to send a memory access request to a memory controller coupledto a memory to obtain a second operand, perform an operation responsiveto the second opcode on a first operand included in the data packet andthe second operand to obtain a result, and store the result in thememory.
 15. The system of claim 14, wherein the link logic is togenerate the data packet with a data portion having a size smaller thana size of a data portion according to the communication protocol and toset the second opcode to identify the smaller size data portion.
 16. Thesystem of claim 14, wherein the second opcode is to cause the secondprocessor to obtain data from the memory and to communicate the data tothe first processor via a return data packet having a data portion witha size smaller than a size of a data portion according to thecommunication protocol.